SEGMENTING SPEECH WITHOUT A LEXICON: 
THE ROLES OF PHONOTACTICS AND SPEECH SOURCE 
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Abstract 

Infants face the difficult problem of segmenting 
continuous speech into words without the bene- 
fit of a fully developed lexicon. Several sources 
of information in speech might help infants solve 
this problem, including prosody, semantic corre- 
lations and phonotactics. Research to date has 
focused on determining to which of these sources 
infants might be sensitive, but little work has been 
done to determine the potential usefulness of each 
source. The computer simulations reported here 
are a first attempt to measure the usefulness of 
distributional and phonotactic information in seg- 
menting phoneme sequences. The algorithms hy- 
pothesize different segmentations of the input into 
words and select the best hypothesis according to 
the Minimum Description Length principle. Our 
results indicate that while there is some useful 
information in both phoneme distributions and 
phonotactic rules, the combination of both sources 
is most useful. 

INTRODUCTION 

Infants must learn to recognize certain sound se- 
quences as being words; this is a difficult prob- 
lem because normal speech contains no obvious 
acoustic divisions between words. Two sources 
of information that might aid speech segmenta- 
tion are: distribution — the phoneme sequence in 
cat appears frequently in several contexts includ- 
ing thecal, cats and catnap, whereas the sequence 
in catn is rare and appears in restricted contexts; 
and phonotactics — cat is an acceptable syllable in 
English, whereas peat is not. While evidence ex- 
ists that infants are sensitive to these information 
sources, we know of no measurements of their use- 
fulness. In this paper, we attempt to quantify the 
usefulness of distribution and phonotactics in seg- 
menting speech. We found that each source pro- 
vided some useful information for speech segmen- 
tation, but the combination of sources provided 
substantial information. We also found that child- 
directed speech was much easier to segment than 



adult-directed speech when using both sources. 

To date, psychologists have focused on two as- 
pects of the speech segmentation problem. The 
first is the problem of parsing continuous speech 
into words given a developed lexicon to which in- 
coming sounds can be matched; both psycholo- 
gists (e.g.. Cutler & Carter, 1987; Cutler & But- 
terfield, 1992) and designers of speech-recognition 
systems (e.g.. Church, 1987) have examined this 
problem. However, the problem we examined 
is different — we want to know how infants seg- 
ment speech before knowing which phonemic se- 
quences form words. The second aspect psycholo- 
gists have focused on is the problem of determin- 
ing the information sources to which infants are 
sensitive. Primarily, two sources have been ex- 
amined: prosody and word stress. Results sug- 
gest that parents exaggerate prosody in child- 
directed speech to highlight important words (Fer- 
nald & Mazzie, 1991; Aslin, Woodward, LaMen- 
dola & Bever, in press) and that infants are sen- 
sitive to prosody (e.g., Hirsh-Pasek et al., 1987). 
Word stress in English fairly accurately predicts 
the location of word beginnings (Cutler & Norris, 
1988; Cutler & Butterfield, 1992); Jusczyk, Cutler 
and Redanz (1993) demonstrated that 9-month- 
olds (but not 6-month-olds) are sensitive to the 
common strong/weak word stress pattern in En- 
glish. Sensitivity to native-language phonotactics 
in 9-month-olds was recently reported by Jusczyk, 
Friederici, Wessels, Svenkerud and Jusczyk (1993). 
These studies demonstrated infants' perceptive 
abilities without demonstrating the usefulness of 
infants' perceptions. 

How do children combine the information they 
perceive from different sources? Aslin et al. spec- 
ulate that infants first learn words heard in isola- 
tion, then use distribution and prosody to refine 
and expand their vocabulary; however, Jusczyk 
(1993) suggests that sound sequences learned in 
isolation differ too greatly from those in context 
to be useful. He goes on to say, "just how far in- 
formation in the sound structure of the input can 



bootstrap the acquisition of other levels [of linguis- 
tic organization] remains to be determined." In 
this paper, we measure the potential roles of dis- 
tribution, phonotactics and their combination us- 
ing a computer-simulated learning algorithm; the 
simulation is based on a bootstrapping model in 
which phonotactic knowledge is used to constrain 
the distributional analysis of speech samples. 

While our work is in part motivated by the 
above research, other developmental research sup- 
ports certain assumptions we make. The input 
to our system is represented as a sequence of 
phonemes, so we implicitly assume that infants are 
able to convert from acoustic input to phoneme se- 
quences; research by Kuhl (e.g., Grieser & Kuhl, 
1989) suggests that this assumption is reason- 
able. Since sentence boundaries provide informa- 
tion about word boundaries (the end of a sentence 
is also the end of a word), our input contains 
sentence boundaries; several studies (Bernstein- 
Ratner, 1985; Hirsh-Pasek et al., 1987; Kemler 
Nelson, Hirsh-Pasek, Jusczyk & Wright Cassidy, 
1989; Jusczyk et al., 1992) have shown that infants 
can perceive sentence boundaries using prosodic 
cues. However, Fisher and Tokura (in press) found 
no evidence that prosody can accurately predict 
word boundaries, so the task of finding words re- 
mains. Finally, one might question whether in- 
fants have the ability we are trying to model — that 
is, whether they can identify words embedded in 
sentences; Jusczyk and Aslin (submitted) found 
that 7 1/2-month-olds can do so. 

The Model 

To gain an intuitive understanding of our model, 
consider the following speech sample (transcrip- 
tion is in IPA): 

Orthography: Do you see the kitty? 

See the kitty? 

Do you like the kitty? 

Transcription: dujusiSakiti 
siSaklti 

dujulalkSakiti 

There are many different ways to break this sam- 
ple into putative words (each particular segmen- 
tation is called a segmentation hypothesis). Two 
such hypotheses are: 

Segmentation 1: du ju si 89 kiti 
si 98 kiti 
du ju lalk 89 kiti 

Segmentation 2: duj us i5 9klt i 
si9 gk Iti 
du jul alk Qgk iti 

Listing the words used by each segmentation hy- 
pothesis yields the following two lexicons: 



Segmentation 1 

1 du 3 kiti 5 si 

2 99 4 lalk 6 ju 

Segmentation 2 

1 alk 59k 9 Iti 

2 du 6 9klt 10 jul 

3 duj 7 i 11 si9 

4 89k 8 19 12 us 

Note that Segmentation 1, the correct hypothesis, 
yields a compact lexicon of frequent words whereas 
Segmentation 2 yields a much larger lexicon of in- 
frequent words. Also note that a lexicon contains 
only the words used in the sample no words are 
known to the system a priori, nor are any carried 
over from one hypothesis to the next. Given a lexi- 
con, the sample can be encoded by replacing words 
with their respective indices into the lexicon: 

Encoded Sample 1: 1,6, 5, 2, 3; 

5, 2, 3; 

I, 6, 4, 2, 3; 

Encoded Sample 2: 2, 12, 6, 4, 5; 

II, 3, 8; 

1, 9, 10, 7, 8; 

Our simulation attempts to find the hypothesis 
that minimizes the combined sizes of the lexicon 
and encoded sample. This approach is called the 
Minimum Description Length (MDL) paradigm 
and has been used recently in other domains to 
analyze distributional information (Li & Vitanyi, 
1993; Rissanen, 1978; Ellison, 1992, 1994; Brent, 
1993). For reasons explained in the nc;xt section, 
the system converts these character-based repre- 
sentations to compact binary representations, us- 
ing the number of bits in the binary string as a 
measure of size. 

Phonotactic rules can be used to restrict 

the segmentation hypothesis space by prevent- 
ing word boundaries at certain places; for in- 
stance, /ka;tsp9z/ ("cat's paws") has six internal 
segmentation points (k aetspoz, kae tspoz, etc), 
only two of which are phonotactically allowed 
(kact sp9z and ka^ts poz). To evaluate the use- 
fulness of phonotactic knowledge, we compared 
results between phonotactically constrained and 
unconstrained simulations. 

SIMULATION DETAILS 

To use the MDL principle, as introduced above, 
we search for the smallest-sized hypothesis. We 
must have some well-defined method of measur- 
ing hypothesis sizes for this method to work. A 
simple, intuitive way of measuing the size of a hy- 
pothesis is to count the number of characters used 
to represent it. For example, counting the charac- 
ters (excluding spaces) in the introductory exam- 



pies, we see that Hypothesis 1 uses 48 characters 
and Hypothesis 2 uses 75. However, this simphs- 
tic method is inefficient; for instance, the length of 
lexical indices are arbitrary with respect to prop- 
erties of the words themselves (e.g., in Hypothesis 
2, there is no reason why /jul/ was assigned the in- 
dex '10' — length two — instead of '9' — length one). 
Our system improves upon this simple size metric 
by computing sizes based on a compact represen- 
tation motivated by information theory. 

We imagine hypotheses represented as a string 
of ones and zeros. This binary string must rep- 
resent not only the lexical entries, their indices 
(called code words) and the coded sample, but 
also overhead information specifying the number 
of items coded and their arrangement in the string 
(information implicitly given by spacing and spa- 
tial placement in the introductory examples). Fur- 
thermore, the string and its components must be 
self-delimiting, so that a decoder could identify the 
endpoints of components by itself. The next sec- 
tion describes the binary representation and the 
length formulae derived from it in detail; readers 
satisfied with the intuitive descriptions presented 
so far should skip ahead to the Phonotactics sub- 
section. 

Representation and Length Formulae 

The representation scheme described below is 
based on information theory (for more examples 
of coding systems, see, e.g., Li & Vitanyi, 1993 
and Quinlan & Rivest, 1989). From this repre- 
sentation, we can derive a formula describing its 
length in bits. However, the discrete form of the 
formula would not work well in practice for our 
simulations. Instead, we use a continuous approx- 
imation of the discrete formula; this approxima- 
tion typically involves dropping the ceiling func- 
tion from length computations. For example, we 
sometimes use a self-delimiting representation for 
integers (as described in Li & Vitanyi, pp. 74-75). 
In this representation, the number of bits needed 
to code an integer x is given by 

£(2) (a;) = 1 + [log2(x + 1)1 +2 [log2 \\og^{x + 1)11 

However, we use the following approximation: 

= 1.5 + log2(a;4-l) + 21og2(log2(a; + 2) + 0.5) 

Using the discrete formula, the difference between 
^(2)(126) and l'^'^\l21) is zero, while the differ- 
ence between l'^'^\\21) and €(2^(128) is one bit; us- 
ing the continuous formula, the difference between 
£(2) (126) and l^'^\l27) is 0.0156, while the differ- 
ence between ^(2) (127) and £(2) (i28) is 0.0155. We 
found it easier to interpret the results using a con- 
tinuous function, so in the following discussion, we 
will only present the approximate formulae. 



The lexicon lists words (represented as 

phonome sequences) paired with their code 
wordaj. For example: 

Word Code Word 

55 [the] 

kaet [cat] 

kiti [kitty] 

si [see] 



In the binary representation, the two columns are 
represented separately, one after the other; the 
first column is called the word inventory col- 
umn; the second column is called the code word 
inventory column. 

In the word inventory column (see Figure |l|a 
for a schematic), the list of lexical items is repre- 
sented as a continuous string of phonemes, without 
separators between words (e.g., Sakaetkltisi. . . ). 
To mark the boundaries between lexical items, the 
phoneme string is preceded by a list of integers 
representing the lengths (in phonemes) of each 
word. Each length is represented as a fixed-length, 
zero-padded binary number. Preceding this list is 
a single integer denoting the length of each length 
field; this integer is represented in unary, so that 
its length need not be known in advance. Pre- 
ceding the entire column is the number of lexical 
entries n coded as a self-delimiting integer. 

The length of the representation of the integer 
n is given by the function 

6^\n) (1) 

We define len{wi) to be the number of 
phonemes in word Wi . If there are p total unique 
phonemes used in the sample, then we represent 
each phoneme as a fixed-length bit string of length 
len{p) — logjp. So, the length of the representa- 
tion of a word Wi in the lexicon is the number 
of phonemes in the word times the length of a 
phoneme: len{p) ■ len{wi). The total length of all 
the words in the lexicon is the sum of this formula 
over all lexical items: 

n n 

{len{p) ■ len{wi)) = len{p) len(wi) (2) 

i=l i=l 

As stated above, the length fields used to di- 
vide the phoneme string are fixed-length. In each 
field is an integer between one and the number of 
phonemes in the longest word. Since representing 
integers between one and x takes logj x bits, the 
length of each field is: 

log2(max len{wi)) 



Code words are represented by square brackets, so 
[x] means 'the code word corresponding to a;'. 



Figure 1 : Schematic diagrams for components of the representation 
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len{w2) 
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len{[wn]) 



[Wb] 



Wi 



W2 



W„ 



\W2 



(a) 
(b) 
(c) 



To be fully self-delimiting, the width of a field 
must be represented in a self-delimiting way; we 
use a unary representation — i.e., write an extra 
field consisting of only '1' bits followed by a termi- 
nating '0'. There are n fields (one for each word), 
plus the unary prefix, so the combined length of 
the fields plus prefix (plus terminating zero) is: 

1 + {n + 1) log2(max len{wi)) (3) 

l...n 

The total length of the word inventory column rep- 
resentation is the sum of the terms in (1^), (0) and 

The code word inventory column of the lexicon 
(see Figure ||b for a schematic) has a nearly iden- 
tical representation as the previous column except 
that code words are listed instead of phonemic 
words — the length fields and unary prefix serve 
the same purpose of marking the divisions between 
code words. 

The sample can be represented most com- 
pactly by assigning short code words to frequent 
words, reserving longer code words for infrequent 
words. To satisfy this property, code words are as- 
signed so that their lengths are frequency-based; 
the length of the code word for a word of frequency 
/(ui) will not be greater than: 

MM) = lo& - lo& ^ 

f{w) 

The total length of the code word list is the sum 
of the code word lengths over all lexical entries: 



n 



E 



leTi{[w]) ^log 
1 



m 



(4) 



As in the word inventory column (described 
above), the length of each code word is represented 
in a fixed-length field. Since the least frequent 
word will have the longest code word (a property 
of the formula for len{[wi])), the longest possible 
code word comes from a word of frequency one: 
m 

log2 Y = ^°g2 "1 

Since the fields contains integers between one and 
this number, we define the length of a field to be: 

log2(log2 m) 



As above, we represent the width of a field in 
unary, so there are a total of n -I- 1 elements of 
this size (n fields plus the unary representation of 
the field width). The combined length of the fields 
plus prefix (and terminating zero) is: 



1 + (n + l)log2(log2 to) 



(5) 



The total length of the code word inventory col- 
umn representation is the sum of the terms in (^) 
and (|). 

Finally, the sequence of words which form the 
sample (see Figure |^c for a schematic) is repre- 
sented as the number of words in the sample (m) 
followed by the list of code words. Since code 
words are used as compact indices into the lex- 
icon, the original sample could be reconstructed 
completely by looking up each code word in this 
list and replacing it with its phoneme sequence 
from the lexicon. The code words we assigned to 
lexical items are self-delimiting (once the set of 
codes is known), so there is no need to represent 
the boundaries between code words. 

The length of the representation of the integer 
TO is given by the function 

^(2)(m) (6) 

The length of the representation of the sample 
is computed by summing the lengths of the code 
words used to represent the sample. We can sim- 
plify this description by noting that the combined 
length of all occurrences of a particular code word 
[wi] is f{wi) • len{[wi]) since there are f{wi) oc- 
currences of the code word in the sample. So, the 
length of the encoded sample is the sum of this 
formula over all words in the lexicon: 



^f{wi) ■ len{[w]) = 
1=1 



E 

1=1 



f{w,) ■ l0g2 



(7) 

The total length of the sample is given by adding 
the terms in ^) and (0). The total length of the 
representation of the entire hypothesis is the sum 
of the representation lengths of the word inventory 
column, the code word inventory column and the 
sample. 



This system of computing hypothesis sizes is 
efficient in the sense that elements are thought 
of as being represented compactly and that code 
words are assigned based on the relative frequen- 
cies of words. The final evaluation given to a hy- 
pothesis is an estimate of the minimal number of 
bits required to transmit that hypothesis. As such, 
it permits direct comparison between competing 
hypotheses; that is, the shorter the representation 
of some hypothesis, the more distributional infor- 
mation can be extracted and, therefore, the better 
the hypothesis. 

Phonotactics 

Phonotactic knowledge was given to the system as 
a list of licit imtial and final consonant clusters of 
English wordaj; this list was checked against all 
six samples so that the list was maximally per- 
missive (e.g., the underlined consonant cluster in 
explore could be divided as ek-splore or eks-plore) . 
In those simulations which used the phonotactic 
knowledge, a word boundary could not be inserted 
when doing so would create a word initial or final 
consonant cluster not on the list or would create a 
word without a vowel. For example (from an ac- 
tual sample — corresponds to the utterance, "Want 
me to help baby?"): 

Sample: wantmituhelpbebi 
Valid Boundaries: want.mi.t.u.help.bc.bi 

In the second line, those word boundaries that are 
phonotactically legal are marked with dots. The 
boundary between /w/ and /a/ is illegal because 
/w/ by itself is not a legal word in English; the 
boundary between /a/ and /n/ is illegal because 
/ntm/ is not a valid word initial consonant clus- 
ter; the boundary between /m/ and /i/ is illegal 
because /ntm/ is also not a valid word final conso- 
nant cluster; the boundary between /p/ and /b/ is 
legal because /Ip/ is a valid word final cluster and 
jhj is a valid word initial cluster. Note that using 
the phonotactic constraints reduces the number of 
potential word boundaries from fifteen to six in 
this example. 

After the system inserts a new word bound- 
ary, it updates the list of remaining valid insertion 
points — adding a point may cause nearby points 
to become unusable due to the restriction that ev- 
ery word must have a vowel. For example (corre- 
sponding to the utterance "green and" ) : 



^In phonological terms, the syllable onsets permit- 
ted at word beginnings and syllable codas permitted at 
word ends. Some languages (including English) have 
different sets of onsets and codas for word- internal and 
word-boundary positions — we use the word-boundary 
set. 



Before: gri.n.^nd 
After: grin 
asnd 

After the segmentation of /grin/ and /asnd/, the 
potential boundary between /i/ and /n/ becomes 
invalid because inserting a word boundary there 
would produce a word with no vowel (/n/). 

Inputs and Simulations 

Two speech samples from each of three subjects 
were used in the simulations; in one sample a 
mother was speaking to her daughter and in 
the other, the same mother was speaking to the 
researcher. The samples were taken from the 
CHILDES database (MacWhinney & Snow, 1990) 
from studies reported in Bernstein (1982). Each 
sample was checked for consistent word spellings 
(e.g., 'ts was changed to its), then was transcribed 
into an ASCII-based phonemic representational. 
The transcription system was based on IPA and 
used one character for each consonant or vowel; 
diphthongs, r-colored vowels and syllabic conso- 
nants were each represented as one character. For 
example, "boy" was written as b7, "bird" as bRd 
and "label" as lebL. For purposes of phonotac- 
tic constraints, syllabic consonants were treated as 
vowels. Sample lengths were selected to make the 
number of available segmentation points nearly 
equal (about 1,350) when no phonotactic con- 
straints were applied; child-directed samples had 
498-536 tokens and 153-166 types, adult-directed 
samples had 443-484 tokens and 196-205 types. 
Finally, before the samples were fed to the simu- 
lations, divisions between words (but not between 
sentences) were removed. 

The space of possible hypotheses is vast§, so 
some method of finding a minimum-length hy- 
pothesis without considering all hypotheses is nec- 
essary. We used the following method: first, evalu- 
ate the input sample with no segmentation points 
added; then evaluate all hypotheses obtained by 
adding one or two segmentation points; take the 
shortest hypothesis found in the previous step and 
evaluate all hypotheses obtained by adding one 
or two more segmentation points; continue this 
way until the sample has been segmented into the 
smallest possible units and report the shortest hy- 
pothesis ever found. Two variants of this simu- 
lation were used: (1) Dist-Free was free of any 
phonotactic restrictions on the hypotheses it could 
form (DiST refers to the measurement of distri- 
butional information), whereas (2) Dist-Phono 
used the phonotactic restrictions described above. 

■^The transcription method ensured the identical 
transcription of all occurrences of a word. 

*For our samples, unconstrained by phonotactics, 
there are about 2^^^^° « 2.5 x 10''°^ hypotheses. 



Each simulation was run on each sample, for a to- 
tal of twelve DiST runs. 

Finally, two other simulations were run on 
each sample to measure chance performance: 

(1) Rand-Free inserted random segmentation 
points and reported the resulting hypothesis, 

(2) Rand-Phono inserted random segmentation 
points where permitted by the phonotactic con- 
straints. Since the Rand simulations were given 
the number of segmentation points to add (equal 
to the number of segmentation points needed to 
produce the natural English segmentation), their 
performance is an upper bound on chance perfor- 
mance. In contrast, the DiST simulations must 
determine the number of segmentation points to 
add using MDL evaluations. The results for each 
Rand simulation are averages over 1,000 trials on 
each input sample. 

RESULTS 

Each simulation was scored for the number of cor- 
rect segmentation points inserted, as compared to 
the natural English segmentation. From this scor- 
ing, two values were computed: recall, the per- 
cent of all correct segmentation points that were 
actually found; and accuracy, the percent of the 
hypothesized segmentation points that were actu- 
ally correct. In terms of hits, false alarms and 
misses, we have: 



recall 
accuracy 



hits 



hits 



misses 
hits 



hits + false alarms 



Results are given in Table ||. Note that there 
is a trade-off between recall and accuracy — if all 
possible segmentation points were added, recall 
would be 100% but accuracy would be low; like- 
wise, if only one segmentation point was added 
between two words, accuracy would be 100% but 
recall would be low. Since our goal is to correctly 
segment speech, accuracy is more important than 
finding every correct segmentation. For example, 
deciding 'littlekitty' is a word is less disastrous 
than deciding 'li', 'tie', 'ki' and 'ty' are all words, 
because assigning meaning to 'littlekitty' is a rea- 
sonable first try at learning word-meaning pairs, 
whereas trying to assign separate meanings to 'li' 
and 'tie' is problematic. 

The performance of Dist-Phono on child- 
directed speech shows that this system goes a long 
way toward solving the segmentation problem. 
However, comparing the average performances of 
simulations is also useful. The effect of phono- 
tactic information can be seen by comparing the 
average performances of Rand-Free and Rand- 
Phono, since the only difference between them 



is the addition of phonotactic constraints on seg- 
mentations in the latter. Clearly phonotactic con- 
straints are useful, as both recall and accuracy im- 
prove. A similar comparison between Rand-Free 
and Dist-Free shows that distributional infor- 
mation alone also improves performance. Note in 
all the results of Dist-Free that using distribu- 
tional information alone favors recall over accu- 
racy; in fact, the segmentation hypotheses pro- 
duced by Dist-Free have most words broken into 
single phoneme units with only a handful of words 
remaining intact. Two comparisons are needed 
to show that the combination of distributional 
and phonotactic information performs better than 
either source alone: Dist-Phono compared to 
Rand-Phono, to see the effect of adding distri- 
butional analysis to phonotactic constraints, and 
Dist-Phono compared to Dist-Free, to see the 
effect of adding phonotactic constraints to distri- 
butional analysis. The former comparison shows 
that the sources combined are more useful than 
phonotactic information alone. The latter com- 
parison is less obvious — the trade-off between re- 
call and accuracy seems to have reversed, with 
no clear winnerO. Data on discovered word types 
helps make this comparison: Dist-Free found 
12% of the words with 30% accuracy and Dist- 
Phono found 33% of the words with 50% accu- 
racy. Whereas the segmentation point data are in- 
conclusive, word type data demonstrate that com- 
bining information sources is more useful than us- 
ing distributional information alone. 

There is no obvious difference in performance 
between child- and adult-directed speech, except 
in Dist-Phono (combined information sources) 
in which the difference is striking: accuracy re- 
mains high and recall rate more than triples for 
child-directed speech. This difference is again sup- 
ported by word type data: 14% recall with 30% ac- 
curacy for adult-directed speech, 56% recall with 
65% accuracy for child-directed speech. 

DISCUSSION 

Our technique segments continuous speech into 
words using only distributional and phonotac- 
tic information more effectively than one might 
expect — up to 66% recall of segmentation points 
with 92% accuracy on one sample, which yields 
58% recall of word types with 67% accuracy (the 
relatively low type accuracy is mitigated by the 
fact that most incorrect words are meaningful 
concatenations of correct words — e.g., 'thekitty'). 



^The higher accuracy of Dist-Phono is a good 
sign. Furthermore, the minimum of the recall/accu- 
racy pair is greater in Dist-Phono than in Dist-Free 
and the maximum of the recall/accuracy pair is also 
greater in Dist-Phono than in Dist-Free. 



Table 1: Results for all simulations averaged over individual speech samples 



Simulation 



Target 


Measure 


Rand-Free 


Rand-Phono 


Dist-Free 


Dist-Phono 


Adult 


% Recall 


25.1 


39.5 


95.5 


22.5 




% Accuracy 


28.9 


50.5 


36.0 


92.0 


Child 


% Recall 


23.4 


40.2 


79.9 


72.3 




% Accuracy 


26.7 


51.7 


37.4 


88.3 


Average 


% Recall 


24.3 


39.9 


88.0 


46.4 




% Accuracy 


27.8 


51.1 


36.6 


89.2 



This finding confirms the idea that distribution 
and phonotactics are useful sources of information 
that infants might use in discovering words (e.g., 
Jusczyk ct al., 1993b). In fact, it helps explain in- 
fants' ability to learn words from parental speech: 
these two sources alone are useful and infants have 
several others, like prosody and word stress pat- 
terns, available as well. It also suggests that se- 
mantics and isolated words need not play as cen- 
tral a role as one might think (e.g., Jusczyk, 1993, 
downplayed the utility of words in isolation). It is 
difficult, if not impossible given currently available 
methods, to determine which sources of informa- 
tion are necessary for infants to segment speech 
and learn words; only this sort of indirect evidence 
is available to us. 

The results show a difference between adult- 
and child-directed speech, in that the latter is eas- 
ier to segment given both distribution and phono- 
tactics. This lends quantitative support to re- 
search which suggests that motherese differs from 
normal adult speech in ways possibly useful to the 
language-learning infant (Aslin et al.). In fact, the 
factors making motherese more learnable might be 
elucidated using this technique: compare the re- 
sults of several different models, each containing 
a different factor or combination of factors, look- 
ing for those in which a substantial performance 
difference exists between child- and adult- directed 
speech. 

Our model uses phonotactic constraints as ab- 
solute requirements on the structure of individual 
words; this implies that phonotactics have been 
learned prior to attempts at segmentation. We 
must therefore show that phonotactics can indeed 
be learned without access to a lexicon — without 
such a demonstration, we arc trapped in circular 
reasoning. Gafos and Brent (1994) demonstrate 
that phonotactics can be learned with high accu- 
racy from the same unsegmented utterances we 
used in our simulations. In general, two meth- 



ods exist for combining information sources in the 
MDL paradigm: one is to have absolute require- 
ments on plausible hypotheses (like our phonotac- 
tic constraints) — these requirements must be in- 
dependently learnable; the other method of com- 
bination is to include an information source in the 
internal representation of hypotheses (like our dis- 
tributional information) — all components of the 
representation are learned simultaneously (see El- 
lison, 1992, for an example of multiple components 
in a representation). 

We would like to extend the system by using 
a more detailed transcription system. We expect 
that this would help the system find word bound- 
aries for reasons detailed in Church (1987) — in 
brief, that allophonic variation may be quite use- 
ful in predicting word boundaries. Another sim- 
pler extension of this research will be to increase 
the length of the speech samples used. Finally, we 
will try the current system on samples from other 
languages, to make sure this method generalizes 
appropriately. 

This research program will provide comple- 
mentary evidence supporting hypotheses about 
the sources of information infants use in learning 
their native languages. Until now, research has fo- 
cused on demonstrations of infants' sensitivity to 
various sources; we have begun to provide quanti- 
tative measures of the usefulness of those sources. 
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